What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?
This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.
# importing sweetviz
#import sys
#!{sys.executable} -m pip install sweetviz;
# importing plotly
import sys
!{sys.executable} -m pip install plotly_express
import pandas as pd
import numpy as np
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.graph_objects as go
#import sweetviz as sv
df_orig = pd.read_csv("tmdb_movies.csv")
df_movies = df_orig.copy()
df_movies.set_index('id',inplace=True)
df_orig.head()
df_orig.shape
# checking columns data types and missing values
df_orig.info()
# showing statistics
df_orig.describe()
Insight
If the SD is more then ONE THIRD OF Arithmetic mean then it is considered high; therefore both popularity, budget_adj, revenue_adj and vote_count contains significant outliers.
This can be displayed in the following histograms
fig = px.histogram(df_orig, x="popularity")
fig.show()
fig = px.histogram(df_orig, x="budget_adj")
fig.show()
fig = px.histogram(df_orig, x="revenue_adj")
fig.show()
fig = px.histogram(df_orig, x="vote_count")
fig.show()
#Here I was trying sweetviz but didn't like it
#report = sv.analyze(df_orig, "revenue")
#report.show_html()
Removing unneeded columns; we won't need these columns to answer our questions:
id
imdb_id
original_title
homepage tagline
overview
budget (we will use budget_adj for to account for inflatation over time)
revenue (we will use revenue_adj for to account for inflatation over time)
release_date
runtime
keywords
vote_count
# removing unneeded columns
df_movies.drop(['imdb_id','homepage','tagline','overview','budget','revenue','release_date','runtime','keywords','vote_count'],axis=1,inplace=True)
df_movies.shape
df_movies.head()
Dealing with null values; columns with Missing Values:
cast
director
keywords
genres
production_companies
# checking for any null values
df_null = df_movies[df_movies.isna().any(axis=1)]
df_null.shape
#removing null values
df_movies.dropna(axis=0,how="any",inplace=True)
Excluding outliers into a separate df; columns with significant outliers
popularity
budget_adj
revenue_adj
vote_count
Using IQR, we can follow the below approach to exclude the outliers into a separate df:
1 Calculate the first and third quartile (Q1 and Q3).
2 Further, evaluate the interquartile range, IQR = Q3-Q1.
3 Estimate the lower bound, lower_bound= Q1 - 1.5 x IQR
4 Estimate the upper bound, upper_bound= Q3 + 1.5 x IQR
5 Exclude all data points that lie outside of the lower and the upper bound into a separate df.
# popularity
Q1 = df_movies.popularity.quantile(0.25)
Q3 = df_movies.popularity.quantile(0.75)
IQR = Q3-Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
df_outliers_popularity = df_movies[(df_movies['popularity'] > upper_bound) | (df_movies['popularity'] < lower_bound)]
df_outliers_popularity.shape
# budget_adj
Q1 = df_movies.budget_adj.quantile(0.25)
Q3 = df_movies.budget_adj.quantile(0.75)
IQR = Q3-Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
df_outliers_budget_adj = df_movies[(df_movies['budget_adj'] > upper_bound) | (df_movies['budget_adj'] < lower_bound)]
df_outliers_budget_adj.shape
# revenue_adj
Q1 = df_movies.revenue_adj.quantile(0.25)
Q3 = df_movies.revenue_adj.quantile(0.75)
IQR = Q3-Q1
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR
df_outliers_revenue_adj = df_movies[(df_movies['revenue_adj'] > upper_bound) | (df_movies['revenue_adj'] < lower_bound)]
df_outliers_revenue_adj.shape
df_movies.shape
df_outliers = pd.concat([df_outliers_popularity, df_outliers_budget_adj, df_outliers_revenue_adj])
df_outliers['id']=df_outliers.index
df_outliers.drop_duplicates('id',inplace=True)
df_outliers.shape
df_movies.drop(df_outliers.index,inplace= True)
df_movies.shape
df_movies.duplicated().sum()
df_movies.drop_duplicates(inplace=True)
df_movies.shape
Here we are gonna check and fix data types issues as well as reformat our data structure
df_movies.info()
df_movies.head(3)
# changing release_year to int
df_movies['release_year'] = pd.to_numeric(df_movies['release_year'] , downcast='integer')
df_movies.info()
df_movies.shape
# solving problem with genres column
df_movies['genres_adj']= df_movies['genres'].str.split("|")
# getting unique values of genres column for explatory section
genres = df_movies.genres_adj.values
genres = np.hstack(genres)
genres_unique , genres_values_count = np.unique(genres,return_counts=True)
genres_dict = dict(zip(genres_unique, genres_values_count))
# dropping genres column
df_movies.drop(['genres'],axis=1,inplace=True)
df_movies.head(0)
# solving problem with director column
df_movies['director_adj']= df_movies['director'].str.split("|")
# getting unique values of director column for explatory section
directors = df_movies.director_adj.values
directors = np.hstack(directors)
directors_unique , directors_values_count = np.unique(directors,return_counts=True)
directors_dict = dict(zip(directors_unique, directors_values_count))
# dropping director column
df_movies.drop(['director'],axis=1,inplace=True)
df_movies.head(0)
# solving problem with cast column
df_movies['cast_adj']= df_movies['cast'].str.split("|")
# getting unique values of cast column for explatory section
cast = df_movies.cast_adj.values
cast = np.hstack(cast)
cast_unique , cast_values_count = np.unique(cast,return_counts=True)
cast_dict = dict(zip(cast_unique, cast_values_count))
# dropping cast column
df_movies.drop(['cast'],axis=1,inplace=True)
df_movies.head(0)
# solving problem with production_companies column
df_movies['production_companies_adj']= df_movies['production_companies'].str.split("|")
# getting unique values of production_companies for explatory section
production_companies = df_movies.production_companies_adj.values
production_companies = np.hstack(production_companies)
production_companies_unique , production_companies_values_count = np.unique(production_companies,return_counts=True)
production_companies_dict = dict(zip(production_companies_unique, production_companies_values_count))
# dropping production_companies column
df_movies.drop(['production_companies'],axis=1,inplace=True)
df_movies.head(0)
popularity
budget_adj
revenue_adj
vote_count
print(df_outliers.hist(figsize=(10,10)))
fig = px.scatter(df_outliers, x="budget_adj" ,y="revenue_adj")
fig.show()
There is no coorelation between budget and revenue in these outliers; although there is a positive coorelation between vote_count and popularity
Conclusion
Most of these outliers have very low popularity,revenue and buget but recent release_years
print(df_null.hist(figsize=(10,10)))
Rows having null values have similar distribution to outliers but even have lowest popularity,revenue,buget and vote count but more recent release_years than null values
Questions:
1 what attributes affect revenue most?
2 what attributes affect profit most?
3 How did people preference's of film's genres changed over the last 10 years? 4 what was the most profitable movie of the last 10 year??
5 what is the highest grossing movie of all time?
6 what production companies produces most films?
7 what cast members are most popular?
8 who are the directors that direct most films?
df_movies.head(0)
fig = px.scatter(df_movies, x="vote_average" ,y="revenue_adj")
fig.show()
fig = px.scatter(df_movies, x="release_year" ,y="revenue_adj")
fig.show()
fig = px.scatter(df_movies, x="budget_adj" ,y="revenue_adj")
fig.show()
No coorelation
df_movies.reset_index(drop=True, inplace=True)
Total_revenue = []
for genre in genres_unique:
def calc_total_revenue(genres_adj,revenue_adj):
if genre in genres_adj:
return revenue_adj
revenue = df_movies.apply(lambda row: calc_total_revenue(row['genres_adj'],row['revenue_adj']), axis=1)
Total_revenue.append(revenue.sum())
plt.figure(figsize=(25,10))
plt.bar(genres_unique, Total_revenue)
plt.title('Total Revenue gained by each genre')
plt.xlabel('Movie Genres')
plt.ylabel('Total Revenue')
TV movies have lowest revenue while Drama movies have highest revenue
df_movies['profit'] = df_movies.revenue_adj - df_movies.budget_adj
fig = px.scatter(df_movies, x="vote_average" ,y="profit")
fig.show()
There is a slightly positive coorelation between vote average and profit
fig = px.scatter(df_movies, x="release_year" ,y="profit")
fig.show()
no coorelation between release year and profit
fig = px.scatter(df_movies, x="budget_adj" ,y="profit")
fig.show()
There is a negative coorelation between budget and profit
Total_profit = []
for genre in genres_unique:
def calc_total_profit(genres_adj,profit):
if genre in genres_adj:
return profit
profit = df_movies.apply(lambda row: calc_total_profit(row['genres_adj'],row['profit']), axis=1)
Total_profit.append(profit.sum())
plt.figure(figsize=(25,10))
plt.bar(genres_unique, Total_profit)
plt.title('Total Profit gained by each genre')
plt.xlabel('Movie Genres')
plt.ylabel('Total profit')
TV movies and western movies have lowest profit while comedy movies have highest profit
years = np.arange(2006,2016)
zeros = np.zeros((len(genres_unique),len(years)), dtype=int)
temp_dict_y = dict(zip(genres_unique,zeros))
def count_genres(genres_list,year):
for g in genres_list:
temp_dict_y[g][year%2006]=temp_dict_y[g][year%2006]+1
for year in years:
df_temp = df_movies[df_movies['release_year']==year]
df_temp.apply(lambda row:count_genres(row['genres_adj'],year), axis=1)
d = []
for genre in genres_unique:
d.append(go.Bar(name=genre, x=years, y=temp_dict_y[genre]))
fig = go.Figure(data=d)
fig.update_layout(barmode='group')
fig.show()
Drama has been the most common type of movies from 2006 to 2015 while western genre has been the least common types of movies in that period
x = years
fig = go.Figure()
for genre in temp_dict_y.keys():
fig.add_trace(go.Scatter(
x=x,
y=temp_dict_y[genre],
name = genre, # Style name/legend entry with html tags
connectgaps=True # override default to connect the gaps
))
fig.show()
from 2014 thriller movies beacame more common than comedy
profit = []
Years_adj = []
for year in years:
net = df_movies[df_movies['release_year']==year].profit.max()
name = df_movies[df_movies['profit']==net].original_title.values[0]
Years_adj.append(str(name)+" ("+str(year)+")")
profit.append(net)
plt.figure(figsize=(45,25))
plt.bar(Years_adj, profit)
plt.title('Total Profit gained by each movie in the last 10 years')
plt.xlabel('Movie name (year)')
plt.ylabel('profit')
max_revenue = df_movies.revenue_adj.max()
movie = df_movies[df_movies.revenue_adj == max_revenue]
movie.original_title.values[0]
i = production_companies_values_count.tolist().index(production_companies_values_count.max())
production_companies_unique[i]
counts = []
names=[]
for name in production_companies_unique:
count = production_companies_dict[name]
if count > 100:
counts.append(count)
names.append(name)
fig = px.pie(values=counts, names=names, title='percentage of movies produced by each production company')
fig.show()
plt.figure(figsize=(35,15))
plt.bar(names, counts)
plt.title('Total number of films produced by each production company')
plt.xlabel('production company')
plt.ylabel('Number of films')
i = cast_values_count.tolist().index(cast_values_count.max())
cast_unique[i]
counts = []
names=[]
for name in cast_unique:
count = cast_dict[name]
if count > 35:
counts.append(count)
names.append(name)
fig = px.pie(values=counts, names=names, title='percentage of movies produced by each production company')
fig.show()
plt.figure(figsize=(35,15))
plt.bar(names, counts)
plt.title('Total number of films acted by top cast')
plt.xlabel('cast name')
plt.ylabel('Number of films')
i = directors_values_count.tolist().index(directors_values_count.max())
directors_unique[i]
counts = []
names=[]
for name in directors_unique:
count = directors_dict[name]
if count > 15:
counts.append(count)
names.append(name)
fig = px.pie(values=counts, names=names, title='percentage of movies produced by each production company')
fig.show()
plt.figure(figsize=(35,15))
plt.bar(names, counts)
plt.title('Total number of films directed by directors')
plt.xlabel('director name')
plt.ylabel('Number of films')
- movies with lowest budget have lower revenues
- TV movies have lowest revenue while Drama movies have highest revenue
- There is a slightly positive coorelation between vote average and profit
- There is a negative coorelation between budget and profit
- TV movies and western movies have lowest profit while comedy movies have highest profit
- Drama has been the most common type of movies from 2006 to 2015 while western genre has been the least common types of movies in that period
- from 2014 thriller movies beacame more common than comedy
- The Black Hole highest grossing movie of all time
- Universal Pictures produces most films
- Samuel L. Jackson acted in most films
- Woody Allen directed most films
Data contains lot of outliers, many rows contain missing values or zeros while other rows have very large numbers